Architecture for distributed, parallel crawling of interactive client-server applications

ABSTRACT

In one embodiment, a distributed computing system includes a first worker node configured to execute a first job, a second worker node configured to execute a second job, and a master node including a processor coupled to a memory. The first job indicates a first portion of an interactive client-server application to be crawled. The second job indicates a second portion of an interactive client-server application to be crawled. The second worker node and the first worker node are configured to execute their respective jobs in parallel. The second job indicates a second portion of an interactive client-server application to be crawled. The master node is configured to assign the first job to the first worker node, assign the second job to the second worker node, and integrate the results from the first worker node and the second worker node into a record of operation of the application.

RELATED APPLICATION

This application claims the benefit under 35 U.S.C. §119(e) of U.S.Provisional Application Ser. No. 61/408,191 filed Oct. 29, 2010,entitled “METHOD AND SYSTEM FOR PARALLEL CRAWLING OF DYNAMIC WEBAPPLICATIONS IN A DISTRIBUTED COMPUTING ENVIRONMENT”.

TECHNICAL FIELD

The present invention generally relates to interactive client-serverapplications and, more particularly, to distributed, parallel crawlingof interactive client-server applications.

BACKGROUND

Modern Web 2.0 applications employ technologies, such as AJAX and Flash,in order to present a rich, dynamic and interactive interface to theuser. However, conventional validation techniques, based on manualtesting, are completely inadequate at capturing or exploring the rich,stateful behavior of such web applications. Some recent research hasproposed the use of custom AJAX web application crawlers tocomprehensively explore, capture and validate the behavior of DynamicWeb 2.0 Applications. However, such crawling is typically verycomputationally intensive and hence practical considerations limit theactual crawling to only a fraction of the web applications' truebehavior-space.

SUMMARY

In one embodiment, a distributed computing system includes a firstworker node configured to execute a first job, a second worker nodeconfigured to execute a second job, and a master node including aprocessor coupled to a memory. The first job indicates a first portionof an interactive client-server application to be crawled. The secondjob indicates a second portion of an interactive client-serverapplication to be crawled. The second worker node and the first workernode are configured to execute their respective jobs in parallel. Thesecond job indicates a second portion of an interactive client-serverapplication to be crawled. The master node is configured to assign thefirst job to the first worker node, assign the second job to the secondworker node, and integrate the results from the first worker node andthe second worker node into a record of operation of the application.

In another embodiment, a method for verifying an interactiveclient-server application includes selecting and assigning a first jobindicating a first portion of an interactive client-server applicationto be crawled, selecting and assigning a second job indicating a secondportion of the application to be crawled, executing the first job andexecuting the second job in parallel, and integrating partial resultsfrom the first job and the second job into a record of operation of theinteractive client-server application.

In yet another embodiment, an article of manufacture includes a computerreadable medium and computer-executable instructions carried on thecomputer readable medium. The instructions are readable by a processor.The instructions, when read and executed, cause the processor to selectand assign a first job indicating a first portion of an interactiveclient-server application to be crawled, select and assign a second jobindicating a second portion of the interactive client-server applicationto be crawled, execute the first job and executing the second job inparallel, and integrate partial results from the first job and thesecond job into a record of operation of the application.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention and itsfeatures and advantages, reference is now made to the followingdescription, taken in conjunction with the accompanying drawings, inwhich:

FIG. 1 is an example embodiment of a distributed computing systemconfigured to provide a service for parallel crawling of one or moreinteractive client-server applications;

FIG. 2 is an example embodiment of an architecture for distributed,parallel crawling of interactive client-server applications, including amaster node and one or more worker nodes;

FIG. 3 may illustrate an example embodiment of the operation of anarchitecture for distributed, parallel crawling of dynamic webapplications;

FIG. 4 illustrates the result of the operation of an example worker nodethrough the illustration of a transition graph model;

FIG. 5 is a screen transition graph of an example dynamic webapplication that may be crawled by distributed computing system;

FIG. 6a illustrates how an empty screen transition graph may be combinedwith a returned trace from a worker node;

FIG. 6b illustrates how the master node may add the results of anotherworker node to the existing master screen transition graph resultingfrom the previous figure;

FIG. 6c illustrates how the master node may add the results of yetanother worker node to the existing master screen transition graphresulting from the previous figure;

FIG. 7 is an example of a marked version of the document object modeltree of a screen of a dynamic web application that has been at leastpartially crawled;

FIGS. 8a and 8b are an example embodiment of a method for coordinatingthe distributed, parallel crawling of interactive client-serverapplications such as dynamic web applications;

FIG. 9 is an example embodiment of a method for efficient partialcrawling of interactive client-server applications such as dynamic webapplications in a parallel, distributed environment;

FIG. 10 is an example embodiment of a method for synchronizing a stategraph created from crawling a portion of an interactive client-serverapplication with a master state graph of the application;

FIG. 11 is an example embodiment of a method for compression of stateinformation in the crawling of interactive client-server applicationssuch as dynamic web applications; and

FIG. 12 is an example embodiment of a method for marking the changesbetween a screen and a reference screen

DETAILED DESCRIPTION

FIG. 1 is an example embodiment of a distributed computing system 100.In one embodiment, the distributed computing system 100 may beconfigured to provide a service for parallel crawling of one or moreinteractive client-server applications. In one embodiment, suchinteractive client-server applications may include web applications 104.Such web applications 104 may include dynamic web applications. Webapplications 104 may be subsequently tested, once they have been crawledto determine their operation and scope.

The distributed computing system 100 may include any distributedcomputing environment 106 including multiple, networked computingresources. Such computing resources may be heterogeneous. In variousembodiments, the connection topology of the computing resources may beunknown or irregular such that the service being implemented in thedistributed computing system 100 cannot take advantage of specifictopologies in order to execute the computation task at hand.

In one embodiment, the distributed computing system 100 may beimplemented in a cloud computing framework or environment. Thedistributed computing system 100 may be implemented by one or morecomputing nodes. One such computing node may be designated as a masternode 110, and other computing nodes may be designated as worker nodes112. The worker nodes 112 and/or master node 110 may be implemented inany suitable electronic device, including but not limited to, a server,computer, or any aggregation thereof. The worker nodes 112 and masternode 110 may include a processor coupled to a memory, and instructions,which when loaded in the memory for execution by the processor, maycarry out the functionality described herein. The worker nodes 112 andmaster node 110 may be communicatively coupled to each other, such asthrough a network arrangement. The network arrangement may beheterogeneous or homogeneous, and may be provided by distributedcomputing environment 106. Any suitable network arrangement may be usedto communicatively couple the worker nodes 112 and master node 110. Theworker nodes 112 and master node 110 of the distributed computing system100 may be networked in any suitable network, such as a wide areanetwork, a local area network, an intranet, the Internet, or anycombination of these elements.

The worker nodes 112 and/or master node 110 may be configured to sharecomputational loads associated with a task to be accomplished in aparallel fashion. For example, worker nodes 112 may work in parallel totest the one or more web applications 104. Such web applications may beoperating on or hosted by one or more websites. To accomplish such atest, the worker nodes 112 and/or master node 110 may be communicativelycoupled to the web applications 104. The master node 110 may becommunicatively coupled to the web application 104, and configured toorganize the operation of other worker nodes 112 to test the webapplication 104.

As part of testing the one or more dynamic web applications 104, theworker nodes 112 and master node 110 may operate a web applicationcrawling service. For example, developers of web applications 104 mayplace such web applications 104 under test, wherein the worker nodes 112and/or master node 110 of the distributed computing system 100 may crawlsuch dynamic web applications 104 to determine their scope andoperation, which may be used in such tests. Such web applications mayinclude web 2.0 applications using technologies such as AJAX, Flash, orother technologies configured to provide rich, dynamic and interactiveuser experiences. Such dynamic web applications may have statefulbehavior and possibility infinite numbers of dynamically generatedscreens. Such behavior may be stateful in that a given generated screenor web page may depend, in content or operation, upon the specificactions which brought about the loading, operation, or creation of thescreen or web page.

The distributed computing system 100 may include middleware running oneach of worker nodes 112 and master node 110. Such middleware may beimplemented as software that interfaces the master node 110 with each ofworker nodes 112. The middleware may be configured to enable theparallelization of computing tasks. Communication between worker nodes112 and master node 110 may be very expensive in terms of time ornetwork or processing resources. Thus, the middleware of the distributedcomputing system 100 may minimize communication between the worker nodes112 and master node 110.

The computational resources of the distributed computing system 100 maybe configured to be leveraged by crawling the dynamic web applications104. The distributed computing system 100 may be configured toparallelize and distribute the crawlings to multiple computing nodes.Consequently, the crawlings should be made conducive to parallelization.The distributed computing system 100 may be configured to conduct theparallelization of the crawlings in a manner that is independent oftopology or architecture. In some embodiments, the nodes of thedistributed computing system 100 may have arbitrary connection topologywhich may be hidden from an application organizing the worker nodes 112and/or master node 110 for parallel crawling of dynamic applications104. The distributed computing system 100 may be configured to minimizecommunication between computing nodes 110, 112, as such nodes may bephysically distant from each other, resulting in expensivecommunication. The worker nodes 112 may be configured to return resultsof crawling, including states, transitions, and new jobs. Thedistributed computing system 100 may be configured to re-integrate theresults of crawlings from the various worker nodes 112 in the cloud ordistributed computing system 100 through the operation of the maincomputing node 110.

FIG. 2 is an example embodiment of an architecture for distributed,parallel crawling of interactive client-server applications, including amaster node 110 and one or more worker nodes 112. Master node 110 may becommunicatively coupled to a worker node 112, and each may becommunicatively coupled to one or more web applications 104 todynamically crawl the web application 104. More worker nodes may becoupled to the master node 110 and the web application 104, but are notshown. Worker node 112 and master node 110 may be communicativelycoupled through a network 230. Network 230 may be embodied in thenetworks or cloud of distributed computing environment 106 of FIG. 1.Worker node 112 may be configured to crawl web application 104 inparallel with other worker nodes, under direction from master node 110.

Master node 110 may include a processor 208 coupled to a memory 206.Master node 110 may include a master crawler application 220. Mastercrawler application 220 may be configured to be executed by processor208 and reside in memory 206. Master node 110 may be communicativelycoupled to web application 104 and worker node 112 through mastercrawler application 220.

Master node 110 may include a job queue 232, representing pending jobswhich are to be crawled. A job may contain a description of a part of aweb application 104 that is to be crawled. Master node 110 may contain aresource queue 234, indicating worker nodes 112 which are available tobe assigned crawl job assignments. Examples of the population ofresource queue 234 and job queue 232 are discussed below. Crawl jobs mayinclude an indication of a portion of a web application 104 that is tobe explored by a worker node 112. The master node 110 may also keep acopy of a master state graph 236, which may be the master copy of ascreen transition graph model of the web application 104, and which maycontain the result of crawling the web applications 104.

Worker node 112 may include a processor 212 coupled to a memory 210.Worker node 112 may include a worker crawler application 218. Workercrawler application 218 may be configured to be executed by processor212 and reside in memory 210. Worker node 112 may be communicativelycoupled to web applications 104 and master crawler application 220through worker crawler application 218.

The processors 208, 212 of the nodes may comprise, for example, amicroprocessor, microcontroller, digital signal processor (DSP),application specific integrated circuit (ASIC), or any other digital oranalog circuitry configured to interpret and/or execute programinstructions and/or process data. The processors 208, 212 may interpretand/or execute program instructions and/or process data stored in therespective memories 206, 210 of the worker nodes 112 and/or master node110. The memories 206, 210 may comprise any system, device, or apparatusconfigured to retain program instructions and/or data for a period oftime (e.g., computer-readable media).

Master node 110 and worker node 112 may be configured to crawl webapplications 104. Some or all portions of the web applications 104 maybe viewed, executed or analyzed by master node 110 and worker node 112.Each node 218, 220 may contain data 222, 224 pertaining to a portion ofthe web application 104. Such data 222, 224 may include informationenabling communication with or use of the web application 104. Forexample, data 222, 224 may include document object models, resourceinformation, or web application version. Such an application may includea browser application 226, 228, and may be implemented as part of workercrawler application 218 or master crawler application 220. Browerapplication 226, 228 may be implemented in any suitable application forloading content from a web application. Browser application 226, 228 maybe implemented as a web client. The browser applications 226, 228 mayalternatively be configured to work in concert with the crawlerapplications 218, 220, if the browsers 226, 228 are not implemented inthem. In one embodiment, the crawler applications 218, 220 may includeFLA-Spider. The crawler applications 218, 220 may be implemented in theJava language. The crawler applications 218, 220 may operate in concertwith the browser applications 226, 228. The crawler application 218, 220may be configured to navigate a web application 104 and programmaticallyperform various operations such as clicking, mouse over, data entry, orany other operation that may simulate or reproduce the action of a userof a web application 104. The crawler applications 218, 220 may beconfigured to explore the possible operations of a web application 104,given different user inputs applied to the web application 104.

The crawler applications 218, 220 running on each node may be configuredto produce a screen transition graph which may model the behavior of theweb application 104 as the web application 104 is crawled, tested, andused. An example screen transition model may be found in FIG. 5, whichis discussed in further detail below. In such a screen transition graph,dots or nodes may be used to represent states, where the state denotesscreens observed on the browser. Thus, a screen transition graph may bea state graph of an interactive client-server application. Transitionsbetween states may denote various possible user actions. For example, abutton click may cause a web application in one state to jump to adifferent state, wherein the available operations for the webapplication have changed. Given such a screen transition model,validation checks may be performed subsequently on the model to verifydesired operation, or other diagnostic actions.

Crawling information to be used by a crawling application may beprovided to each instance of the crawling application, such as workercrawler application 218, so that the distributed computing system 100may provide parallel crawlings of web applications under test 104. Forexample, a crawl specification and/or crawl data may be provided to thecrawling application 218. The crawl specification may indicate the formof the web application 104, the expected behavior of the web application104, or any other suitable information about using the web application104. The crawl data may include actions to be taken by the browser 226,data 222 to be entered, or any other information indicating an action tobe taken. For example, for a given page as defined by the crawlspecifications, crawl data may indicate that any number of mouse-over'sare to be conducted on various specific elements of the web application104.

Master crawler application 220 may be configured to coordinate thecrawling of worker node 112 and other worker nodes 112 in a distributedcomputing system. Master crawler application 220, in combination withthe various instances of worker crawler application 218, may beconfigured to serve as the middleware of distributed computing system100 as described above. Master crawler application 220 may be configuredto perform some or all of the functions of the master node 110 relatedto crawling web applications 104. Worker crawler application 218 may beconfigured to perform some or all of the functions of the worker node112 related to crawling web applications 104. In various embodiments,the functionality of master crawler application 220 and worker crawlerapplication 218 may be divided differently depending upon therequirements of crawling web applications 104.

FIG. 3 shows an example of the operation of various nodes within thedistributed computing system 100. FIG. 3 may illustrate an exampleembodiment of the operation of an architecture for distributed, parallelcrawling of interactive client-server applications. The distributedcomputing system 100 may include as many worker nodes 112 as areavailable for the tasks described herein. The master node 110 may issuecommands to worker nodes 112, which in turn may provide statusinformation as well as results to the master node 110.

The master node 110 may issue commands to worker nodes 112, such ascrawl job assignments, wherein specific worker nodes 112 from resourcequeue 234 are assigned specific jobs originating from the job queue 232.Worker nodes 112 may communicate their status as well as crawlingresults back to the master node 110. Such information may include thecompletion status of various crawl jobs that have been assigned to theworker nodes 112. This information may also include partial results fromsuch crawl jobs. Such information may also include new crawl jobs whichhave been discovered by the worker node 112. Worker nodes 112 may beconfigured to discover new crawl jobs by determining unused actions instates of the web application 104. Such actions may be unused because analternative action was chosen instead. The new crawl jobs may comprise astarting position for crawling the web application, wherein the crawlingmay utilize a previously unused action. The master node 110 may beconfigured to merge the results received from worker nodes 112 into themaster state graph 236.

As described above, each worker node 112 may have a copy of some or allof the crawler application as well as crawl configuration information.The worker nodes 112 may perform an assigned crawling task, generate newcrawling jobs discovered while crawling, and report back the crawlingresults and generated jobs to the master node 110. New crawling jobs mayinclude additional portions or options of the dynamic web application104 to be explored, which are discovered as the worker node 112 conductscrawling activity.

Distributed computing system 100 may be configured to utilize asynchronization scheme for distributed, parallel crawling of dynamic webapplications. Such a scheme may enable the synchronization ofinformation regarding the results of crawling a web application 104,such as master state graph 236, between the master node 110 and workernodes 112. As part of such a scheme, the master node 110 and workernodes 112 may be configured to reduce communication overhead betweensuch entities for synchronizing information such as the master stategraph 236. Worker nodes 112 may be configured to continue to crawl theirportions of the dynamic web application independently. Worker nodes 112may be configured to provide information about the state graph as seenfrom the perspective of the worker nodes 112 periodically to the masternode 110. Such information may include a partial state graph. Eachworker node 112 may not have the full master state graph 110 as seen bythe master node 110. Instead, each worker node 112 may have a partialstate graph reflecting portions of the web application 104 that theworker node 112 was initialized with, in addition to new portions of theweb application 104 that the worker node 112 has discovered whilecrawling the web application 104. Such a partial state graph may includeinformation such as newly discovered states, transitions, or jobs. Thepartial state graph may contain information discovered since a previoussynchronization was conducted. The worker node 112 may select betweentransmitting partial state graphs and/or newly discovered jobs on aperiodic basis, transmitting partial state graphs and/or newlydiscovered jobs upon completion of a crawling job, or transmittingpartial state graphs and/or newly discovered jobs as they arediscovered. Such a selection may be made based on operating parametersprovided by master node 110. In addition, worker nodes 112 may beconfigured to compress sets of such states before transmitting them tothe master node 110.

The master node 110 may be responsible for purging any duplication ofwork observed between different worker nodes 112. Such duplication maybe observed by the master node 110 comparing the results received fromworker nodes 112, wherein such results may include partial state graphs.The master node 110 may be configured to remove duplicate states andtraces showing the operation of the web application 104 while mergingdata received from the various worker nodes 112. The master node 110 maybe configured to purge duplicate jobs in the job queue 232 wherein suchjobs represent portions of the dynamic web application 104 that havealready been crawled. The master node 110 may also be configured to sendpurge signals to worker nodes 112, wherein the worker nodes 112 areinstructed to stop working on jobs that have been determined by themaster node 110 as duplicates. Such duplicate jobs may have beenassigned already to other worker nodes 112, which are likely presentlyexecuting such jobs, or may have already finished. Such purge signalsmay be based on a record kept by the master node 110 of which jobs havebeen assigned to which worker nodes 112, as well as an indication of thescope of such a job.

The master node 110 may be configured to schedule jobs from the jobqueue 232 to worker nodes 112 in the resource queue 234. The master node110 may be configured to make such scheduling on any suitable basis. Inone embodiment, the master node 110 may be configured to schedule jobsfrom the job queue 232 to worker nodes 112 in the resource queue 234 ona first-in, first-out basis. In another embodiment, the master node 100may select jobs from the job queue 232, and worker nodes 112 from theresource queue 234, by determining the best match among the jobs orresources. In such an embodiment, matches may be determined on abest-first basis.

Using a best-first basis, the master node 110 may choose the bestcandidate job to schedule, from the job queue 232, and choose the bestresource to schedule it on among the available resources in the resourcequeue 234. The selection of the best candidate job may be based on anysuitable factor. In one embodiment, a time-stamp of the job may be usedas a factor in selecting the best candidate job. In such an embodiment,earlier time-stamped jobs may get a higher preference. In anotherembodiment, the length of the initialization trace for the job may beused as a factor in selecting the best candidate job. In such anembodiment, jobs with smaller initialization traces may have lowerinitialization costs and may thus be preferred, depending upon theavailable resources.

The selection of the best candidate resource from the resource queue 234may be based on any suitable factor. In one embodiment, an insertiontime-stamp of the resource may be used as a factor in selecting the bestcandidate resource. In such an embodiment, earlier time-stampedresources may get a higher preference, so as to maximize the resource'sutilization. In another embodiment, computation strength of the resourcemay be used as a factor in selecting the best candidate resource. Insuch an embodiment, the computing power of the resource may be used tomatch it to an appropriately-sized job. In yet another embodiment,communication overhead of the resource may be used as a factor inselecting the best candidate resource. In such an embodiment, ifinformation is known about the connection topology of the resource tothe master node 110, the information can be used to give preference toresources with more efficient, shorter, or faster communication with themaster node 110. Such information may be determined by the statisticalresults of worker nodes 112 completing tasks.

To determine either the best candidate resource or the best candidatejob, a function, for example, a weighted sum of the factors describedabove, may be employed to determine the best candidate. Such weightedsums may be used as cost functions for choosing the best candidate. Insuch a case, if time-stamps of the jobs and the resources are used asthe sole criterion for choosing jobs and resources, the scheme begins tobecome a first-in, first-out mechanism typical of a basic queue datastructure.

The master node 110 may be configured to integrate traces and statesreceived from the worker nodes 112 into the master state graph. Workernodes 112 may provide completed computations representing of a sub-treeor a trace of the behavior of the web application that has beencompleted and crawled. The master node 110 may also receive indicationsof new computations as determined by one or more worker nodes 112. Uponreception of traces and states from the worker nodes 112, the masternode 110 may be configured to check to determine whether duplicatesexist in the received state or traces as compared to information alreadydetermined in the master state graph, or as compared to states in jobsthat have been assigned to other worker nodes 112. If such duplicatesare detected, the master node 110 may be configured to purge duplicatejobs from the job queue 232. The master node 110 may also be configuredto purge duplicate crawls currently executing on worker nodes 112 byissuing a purge command. The master node 110 may also be configured tomerge the received information with the information in the master stategraph, removing duplicates.

FIG. 4 illustrates the result of the operation of an example worker node112 through the illustration of a transition graph model 402. Asdescribed above, a worker node 112 may be configured to run a copy ofthe crawling application. The worker node 112 may also containappropriate crawling settings for a web application 104 to be tested.The worker node 112 may be configured to initialize its operation with apartial trace 404 provided by the master node 110. Such a partial trace404 may be an alternative to the worker node 112 initializing itsoperations with a full copy of the master state graph 236. However, suchan initialization with the master state graph 236 may cost more in termsof communication between the worker node 112 and the master node 110.Such an partial trace 404 may include a description of the actions thatmust be taken from the web application start page 406, such asindex.jsp, in order to reach a particular state such as S₀ within themaster state graph, wherein the particular state is to be crawled by theworker node 112 as part of the job that was assigned to it by the masternode 110. The worker node 112 may be configured to continue crawlingfrom S₀ and its children states, such as S₁, by examining differentbranches and actions and storing other information as new jobs. Theworker node 112 may reach a point in the crawling of the job in whichthe crawling of the trace will terminate, even though the job has notbeen completed. Such cases are discussed below.

In another example, if a worker node 112 was given a particular pageinside of a dynamic web application to crawl, and was presented with achoice of menu items to be selected on such a page, the worker node 112may be configured to select the first choice in the menu and explore thesubsequent operation of the dynamic web application, and store thestates or actions representing the remaining unselected menu choices asfuture jobs. As the worker node 112 crawls the portions of the dynamicweb application to which it was assigned, it may create a local stategraph representing the states encountered and the actions taken to reachsuch states. The worker node 112 may be configured to terminate crawlingif it reaches a state which it has seen before. Such a state may includea state which is present in the local state graph. The worker node 112may be configured to terminate crawling if the crawling hits a depthlimit or time limit as set by the crawling specification. For example,along a particular path if a worker node 112 hits a depth of tensubsequent actions, the worker node 112 may terminate its crawling. Inaddition, the worker node 112 may be configured to terminate crawling ifit receives a purge command from the master node 110.

The worker node 112 may be configured to periodically transmitinformation including information about new states, new tracesrepresenting decision paths taken in the web application, and new jobsto the master node 110. The periodic nature of such a transmittal may beset statically or dynamically based on communication and computationtradeoffs as determined by the distributed computing system 100. Thespecific periodic nature of a given distributed computing system maydepend upon the resources of the distributed computing system, thenature of the dynamic web application to be tested, or other unforeseenfactors. Specific or optimal values of the periodic nature may bedetermined experimentally. Upon termination, the worker node 112 may beconfigured to register itself with the resource queue 234 available inthe master node 110.

The distributed computing system 100 may be configured to utilize atechnique for stateless distributed parallel crawling of dynamic webapplications. In one embodiment, the distributed computing system 100may be configured to select between conducting stateless parallelizationor stateful parallelization of crawling. A stateful parallelization ofcrawling may include the steps described herein in which states arecompared at the master node 110 to search for duplicates among resultsreturned from worker nodes 112, when compared to the master state graph.A stateless parallelization of crawling may cause the master node 110 tonot seek to eliminate such duplicates, and the resulting master stategraph may not indicate that a state appearing lower in the executiontree is also a duplicate of a higher-appearing state. A statefulparallelization scheme may be more useful when the underlying stategraph has significant state sharing, state reconvergence and cycles. Thedistributed computing system 100 may be configured to use statelessparallelization if little reconvergence exists in the state graph of agiven dynamic web application; for example, if the state graph haslargely a tree-like structure. When stateless parallelization isemployed by the distributed computing system 100, the master and workernodes 112 may omit state comparisons. Such an omission of statecomparisons may speed up the operation of master node 110 as state graphmerging may be accomplished with fewer resources. The required purgingoperations of master node 110 may be eliminated, depending upon thestatus of stateless parallelization. Similarly, it may speed up thecrawling operation at the worker nodes 112. Further, worker nodes 112may be configured to transmit results only once at the end ofcomputation when using stateless parallelization. However, the resultingmaster state graph may contain states which appear in multiplepositions.

Worker nodes 112 may be configured to compress the state of theiroperation and of newly discovered jobs through any suitable means. Inone embodiment, worker nodes 112 may be configured to use such statecompression when successive pages of a dynamic web application representstates which differ only slightly from a previous state. For example, agiven user action on a given screen of an AJAX-built web application mayresult in changes or updates to only a small part of the current screen.Thus, the new screen thus obtained differs in its content, only slightlyfrom the previous screen. Thus, the worker node 112 may be configured toonly store the differences between the document object models ofsuccessive states of a dynamic web application, which can then betransmitted to the master node 110 and decompressed by the master toobtain the full representation of the respective states. Statecompression may be enabled when the difference between successive statesis lower than a given threshold. Such a threshold may be set in terms ofrelative or absolute differences between successive states of thedynamic web application. Worker nodes 112 may be configured to enableand disable state compression depending upon the particular dynamic webapplication pages that are being crawled presently.

Distributed computing system 100 may be configured to crawl any suitabledynamic web application. FIG. 5 is a screen transition graph of anexample dynamic web application 500 that may be crawled by distributedcomputing system 100. The screen transition graph may contain a stategraph. Dynamic web application 500 may be configured to display twobuttons, Button1 and Button2. The appearance and functionalityassociated with an appearance of Button1 and Button2 may depend uponvarious previous actions from a user. The different states in which thedynamic web application 500 may exist are represented by S1, S2, S3, andS4. The screen transition graph of FIG. 5 may fully represent thepossible states of dynamic web application 500. Thus, the screentransition graph of FIG. 5 may be the completed result of dynamicallycrawling dynamic web application 500.

The code for dynamic web application 500 may be embodied by thefollowing:

<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN”> <html><head>   <script type=“text/javascript” >     function toggle1( ) {      if(document.getElementById(“button1”).value ==       “Click Me !”)        document.getElementById(“button1”).value = “I'm clicked”;      else         document.getElementById(“button1”).value = “Click Me!”;     }     function toggle2( ) {      document.getElementById(“button2”).disabled = true;   }</script></head> <body>   <input id=“button1” style=“display:block” class=“btn”type=“button”     name=“firstButton” onclick=“toggle1( );” value=“ClickMe !” />   <input id=“button2” style=“display:block” class=“btn”type=“button”     name=“secondButton” onclick=“toggle2( );”    value=“Click Me Too!” /> </body> </html>

Thus, dynamic web application 500 may be configured to change theappearance of Button1, wherein Button1 may be initially set to display“Click Me!,” and upon being clicked, display “I'm clicked.” Button1 maybe configured to toggle the display between these values upon subsequentclicks. Button2 may be configured to initially display “Click Me Too!,”and upon being clicked, become disabled. This may be represented in FIG.5 as initiating operation in the state represented by S1. If Button1 isclicked, the dynamic web application 500 may transition to the staterepresented by S2. Once there, if Button1 is clicked again, the dynamicweb application 500 may transition back to the S1. If instead Button2 isclicked, the dynamic web application 500 may transition instead to thestate represented by S3. Similarly, clicking Button2 from S1 may causethe dynamic web application 500 to transition to the state representedby S4. The dynamic web application 500 may transition between S3 and S4when Button1 is clicked.

Interactive client-server applications to be crawled by distributedcomputing system 100 may be configured to operate differently dependingupon previous actions that have been taken, which may be represented asdifferent states. In the example of dynamic web application 500, theability to click Button2 may depend on whether Button2 was previouslyclicked. Such an action may not be repeatable because no means oftransitioning back to the original state exists. States S3 and S4, onceentered, may cause the dynamic web application 500 to not be able toreturn to states S1 and S2. On the other hand, the status of Button1,while also dependent upon the current state, may be toggled. Such acycle may exist in the actions between S1 and S2, or in the actionsbetween S3 and S4.

In operation, returning to FIG. 3, distributed computing system 100 mayutilize a technique for coordinating the distributed, parallel crawlingof interactive client-server applications, including dynamic webapplications.

The master node 110 may take any suitable actions necessary tocoordinate the crawling of a dynamic web application. In one embodiment,the master node 110 may schedule pending jobs to resources waiting toperform such jobs. In another embodiment, the master node 110 may mergeresults which have been received from worker nodes 112. In such anembodiment, the master node 110 may merge such results with resultspreviously received from other worker nodes 112.

In one embodiment, the tasks of the master node 110 may be implementedusing some or all of the following pseudocode:

global jobQ, resourceQ, masterSTG procedure ScheduleJobs( ) {   whileNotEmpty(jobQ) & NotEmpty(resourceQ)     do       job ← GetFirst(jobQ)      worker ← GetFirst(resourceQ)       ScheduleJob(job, worker) }procedure MergeWorkerResults(compTrace, newJobs) {   comment: Masternode will first merge compTrace into the   master graph   trace ←UncompressGraph(compTrace)   for each state in trace)     do       ifExists(state; masterSTG) = FALSE       then Add(state; masterSTG)   foreach transition in trace     do       if Exists(transition, masterSTG) =FALSE       then Add(transition, masterSTG)   comment: Master node willmerge newJobs into existing jobs   for each job in newJobs     do      if Exists(job, jobQ) = FALSE       then Add(job,jobQ) }

In the above pseudocode, masterSTG may represent the master screentransition graph model of the crawled application. FIG. 5, for example,may represent a completed master screen transition graph of the dynamicweb application 500. Such a master screen transition graph may be storedin master state graph 236. JobQ may represent the pending queue of jobsthat are to be processed as part of crawling the web application undertest. In one embodiment, jobQ may be implemented as a FIFO queue.ResourceQ may represent the pending queue of resources such as workernodes 112 that are to be assigned jobs. In one embodiment, resourceQ mayoperate as a first-in first-out queue.

The master node 110 may schedule pending jobs, such as portions of adynamic web application to be crawled, to waiting resources such asworker nodes 112. As shown above, the master node 110 may, while entriesexist in both the JobQ and the resourceQ, get the first job from the topof the job queue 232, get the first resource from the resourceQ, andschedule the job to be conducted by the resource. Any suitable methodmay be used to get a job from the jobQ or a resource from the resourceQ.In one embodiment, the job and/or the resource that has been pending thelongest may be obtained.

The master node 110 may merge worker results returned from worker nodes112 with traces that have already been created. Each worker node 112that synchronizes with the master node 110 may send any suitableinformation to the master node 110. In embodiment, such a worker node112 may send at least two items of data to the master node 110, acompressed trace (such as compTrace) and a new set of jobs that wereencountered (such as newJobs) while the worker node 112 was crawling aportion of a dynamic web application. The master node 110 may merge suchinformation into information being kept at the master node 110 such asjobQ, resourceQ, and the masterSTG. The master node 110 may perform suchtasks in any suitable manner.

In parallel with scheduling pending jobs, the master node 110 may mergeinformation received concerning new traces that were encountered by theworker node 112 into the master screen transition diagram. In oneembodiment, the master node 110 may uncompress a returned trace that wascompressed by a worker node 112. A trace may contain states andtransitions between the states. The master node 110 may determine, foreach state found in the returned trace, whether such a state exists inthe master state diagram. If such a state does not exist, then it isadded to the master state diagram. For each transition in the returnedtrace, the master node 110 may determine if such a transition exists inthe master state diagram. If such a transition does not exist, then itis added to the master state diagram. It may be advantageous to firstdetermine the new states, followed by the new transitions.

The master node 110 may merge information concerning new jobs that wereencountered or created by the worker node 112 into the job queue 232.The master node 110 may merge such information in any suitable manner.In one embodiment, the master node 110 may determine, for each job inthe newJobs that are returned to the master node 110, whether the jobalready exists in the jobQ. If the job does not exist in the jobQ, thenit may be added to jobQ.

FIGS. 6a-6c illustrate examples of how the master node 110 may addinformation from the worker nodes 112 to create a master screentransition graph. FIG. 6a illustrates the case wherein an empty screentransition graph may be combined with a returned trace from a workernode 112. In the returned trace, the worker node 112 has crawled fromthe first state, S1, by clicking Button1 to go to the second state S2,and crawled back to state S1 by clicking Button1 again. Since no statesor transitions are already present in the master screen transitiongraph, the combination results in the returned trace. Options notchosen, such as clicking Button2 in the state S2, may represent futurejobs to be completed, which may be returned by the worker node 112 tothe master node 110 and added to the job queue 232.

FIG. 6b illustrates how the master node 110 may add the results ofanother worker node 112 to the existing master screen transition graphresulting from the previous figure. The returned trace in FIG. 6b may bethe result of a worker node 112 starting from the first state S1 andthen crawling to the state S4 by clicking Button2. The worker node 112then may have crawled to state S3 by clicking Button1, and crawled backto the state S4 by clicking Button1 a second time. Adding this returnedtrace to the existing master screen transition graph may cause themaster node 110 to pare the returned trace's instance of S1, butotherwise represent the union of the two graphs for both states andtransitions. The worker node 112 may have been the same or a differentworker node 112 than that which returned a trace in FIG. 6 a.

FIG. 6c illustrates how the master node 110 may add the results of yetanother worker node 112 to the existing master screen transition graphresulting from the previous figure. The returned trace in FIG. 6c may bethe result of a worker node 112 crawling from S1, clicking Button1 totransition to S2, and then clicking Button2 to transition to S3. Once inS3, the worker node 112 may click Button1 to crawl to S4, and click itagain to return to S3. Adding this returned trace to the existing masterscreen transition graph may cause the master node 110 to add thetransition from S2 to S3 to the master screen transition graph, as theremaining portions of the returned trace may already exist in the masterscreen transition graph. The worker node 112 may have been the same or adifferent worker node 112 than those which returned a trace in FIG. 6aand FIG. 6b . The worker node 112 may have received S2 as a startingstate from the master node 110. Such an instruction may have arisen froma previously identified job added to the job queue 232, such as a workernode 112 previously exploring some operations available in S2 but notselecting Button2.

Returning to FIG. 3, distributed computing system 100 may utilize atechnique for a technique for efficient partial crawling of aninteractive client-server application, such as a dynamic webapplication, in a parallel, distributed environment. Worker nodes 112 indistributed computing system 100 may crawl portions of a dynamic webapplication and report the resulting discovered trace to the master node110. Worker nodes 112 may crawl the dynamic web application in anysuitable manner.

In one example, the tasks of the worker node 112 may be implementedusing some or all of the following pseudocode:

procedure InitializeWorker(config)   LoadConfig(config)   returnprocedure WorkerCrawlTrace(seedTrace)   localStateGraph = Null   newJobs= Null   currentState ← LoadPage(initScreen)   currentState ←ExecuteTrace(seedTrace)   while NotVisited(currentState)&  WithinResourceBound(localStateGraph)     do       ifIsReadyToSynchronize(localStateGraph)         thenSyncWithMaster(localStateGraph, newJobs)       actionList ←ExtractActions(currentState)       firstAction ←GetFirstAction(actionList)       actionList ← actionList − firstAction      currentState ← ExecuteAction(firstAction)       newJobs ← newJobsU actionList   SyncWithMaster(localStateGraph, newJobs)   returnprocedure SyncWithMaster(localStateGraph, newJobs)   deltaTrace ←CompressGraph(localStateGraph)   SendToMaster(deltaTrace, newJobs)  MarkSentStates(localStateGraph)   newJobs = Null   return

The master node 110 may use a function such as LoadConfig(config) toinitialize the worker crawler application 218 on a worker node 112 suchas w1 according to the configuration config, in order to prepare theworker node 112 for future crawling tasks. In one embodiment, the workernode 112 itself may initialize the worker crawler application 218 on theworker node 112. Config may include any suitable information toinitialize the worker node 112. In one embodiment, config may include anaddress, such as a url, of the dynamic web application to be crawled. Inanother embodiment, config may include directives for the worker node112 on how to crawl the dynamic web application. Such directives mayinclude directives on target document object model (DOM) elements; forexample, html “<a>” tags. Such directives may also include user actionsto execute on the dynamic web page; for example, clicking on specific orcategorical items, and/or specific user data to input at appropriatestages during crawling such as authentication data on the login page.

In one embodiment, this initialization process may utilize passing of aset of parameters, such as strings, to a pre-built crawler applicationpreviously available on the worker node 112. Such a pre-built crawlerapplication may be implemented in worker crawler application 218. Inanother embodiment, this initialization process may generate new sourcecode based on the directives in config, which may then be compiled andused to drive the crawler application on w1. Such a crawler applicationmay operate in worker crawler application 218. The generation orcompilation of new source code may be carried out on the master node110, in an application such as master crawler application 220. Thegeneration or compilation of the new source code may be done on theworker node 112.

A worker node 112 may crawl a dynamic web application starting at adesignated start position. The designated start position may beimplemented in an existing known trace, including states and transitionspreviously determined. In one embodiment, the worker node 112 mayutilize the function procedure WorkerCrawlTrace(seedTrace) as shownabove. SeedTrace may be a starting trace passed to the worker node 112from the master node 110.

Before crawling a dynamic web application, a worker node 112 may createa local state graph, such as localStateGraph and set it as empty. Theworker node 112 may create a structure, such as newJobs, for containingnew jobs that are discovered while crawling, and set it as empty. Theworker node 112 may load the initial screen. The worker node may use afunction such as LoadPage(url) as shown above to do so, by loading astarting address such as initScreen into its worker crawler application218 in preparation for crawling a web application corresponding to theaddress. In one embodiment, the address is the initial or home page ofthe web application to be crawled. The results of loading a startingaddress into the web crawler application may be stored in a structuresuch as currentState.

The worker node 112 may then programmatically execute a trace to reachthe desired state. Such an execution may use the functionExecuteTrace(SeedTrace). ExecuteTrace may in turn call a function suchas ExecuteAction(action) to execute a series of actions in SeedTrace.Action may include one or more parameters to direct the immediateoperation of the worker node 112 on a specific page. In one embodiment,action may include a pair of parameters {t, u}. t may include a targetDOM element, such as a button or a link, on the current page in thebrowser. u may include a user action, such as a button click or a formdata input, to be executed on t. ExecuteAction may programmaticallyexecute the action specified by {t, u} on the current screen or state.In one embodiment, ExecuteAction may be operated assuming that thetarget element t is available on the current browser screen or state.

Thus, the worker node 112 may make an initial crawl through the dynamicweb application as defined by seedTrace, or any other initial tracedefined by the master node 110. Such an initial crawl may includerepeating steps originally taken by other worker nodes 112. The workernode 112 may assign the results to a structure storing the current stateof the crawl, such as currentState.

The worker node 112 may determine whether to continue executing thestate graph or not. If so, the worker node 112 will continue to executeactions in the dynamic web application and perform related bookkeepingtasks. Otherwise, the worker node 112 will finalize the crawling of itsportion of the dynamic web application and synchronize the state graphand any newly create jobs with the master node 110.

The worker node 112 may determine whether the current state has not beenvisited, and if the current local state graph is operating within thedefined resource bounds. While such criteria are true, the worker node112 may conduct a sequence of events to crawl a portion of the dynamicweb application. To determine whether such criteria are true, adetermination about whether a state has been visited before may be madeby using the NotVisited(state) function. The worker node 112 may look upa state in the localStateGraph to check if the state exists within it.If the state already exists within the localStateGraph the worker node112 may determine that the state has been visited before. If the statehas been visited before, the NotVisited function may return false, andreturn true otherwise. Such a determination whether the worker node 112is operating within the bounds of the application that it has beenassigned through any suitable method, such as the functionWithinResourceBound(localStateGraph). In such an example, the workernode 112 may determine whether the trace of the localStateGraph iswithin the resource bounds specified in the config with which the nodewas initialized, possibly using the function LoadConfig. Such bounds maybe defined through any suitable metric. In one embodiment, the number ofstates in the trace making up the localStateGraph may be comparedagainst a maximum threshold. In another embodiment, the depth to whichcrawling has been performed in the trace making up the localStateGraphmay be compared against a threshold. In yet another embodiment, the timeelapsed since the start of the current crawling task may be comparedagainst a maximum threshold. In various embodiments, more than one suchcriteria may be combined in the resource bounds specified in the config.

Such a sequence may include one or more of the following steps. Theworker node 112 may determine whether the local state graph, such aslocalStateGraph, is ready to be synchronized to the master node 110, andif so, then synchronize the localStateGraph along with any new jobs thathave been created, such as those in the structure newJobs. The workernode 112 may make such a determination through any suitable method, suchas using the function IsReadyToSynchronize(localStateGraph). In such acase, the worker node 112 may determine whether sufficient crawling hasbeen performed. Such a determination may be made, for example, bymeasuring the number of crawled states, the depth to which crawling hasbeen performed, or the time elapsed, since the last synchronizationevent caused by the worker node 112. Use of the functionIsReadyToSynchronize may return true if the localStateGraph is ready tobe synchronized according to the specified criteria.

From the current state of the dynamic web application, represented bycurrentState, the worker node 112 may extract the available actions andstore them in a structure such as actionList. The worker node 112 mayanalyze a screen or state of the dynamic web application to determinepossible actions to be taken at the screen or state. The worker node 112may conduct such analysis through any suitable method. In oneembodiment, the worker node 112 may conduct such analysis using thefunction ExtractActions(screen). Typically, the screen or state to beanalyzed will be the currentState or the current screen in the browser.The worker node 112 may conduct the analysis based on based ondirectives specified in the config, with which the crawler wasinitialized, to extract a list of candidate action elements. The workernode 112 may determine possible actions to be taken and place thesewithin a data structure such as a list.

After determining the possible actions to be taken at the screen, theworker node 112 may extract an action from the list of possible actions.The worker node 112 may use the function GetFirstAction(actionList) toaccomplish this task, wherein the actionList is an ordered collection ofactions that may be taken at the screen. The worker node 112 may removethe extracted action from the available actions on the current state orscreen. The worker node 112 may store the action in a structure such asfirstAction. The worker node 112 execute the extracted action, and storethe results of the execution in the structure for the current state orscreen. The worker node 112 may combine the list of new jobs that havebeen encountered while crawling, in a structure such as newJobs, withthe actions determined from the current state or screen. In oneembodiment, the worker node 112 may determine the union of the two setsof jobs, paring any duplicates. The worker node 112 may store theresults in the structure for the list of new jobs.

If the crawling is not to continue, then the worker node 112 maysynchronize with the master node 110. The worker node 112 may conductsuch synchronization at this or any other suitable time. In oneembodiment, the worker node 112 may use the functionSyncWithMaster(localStateGraph, newJobs) to perform suchsynchronization. The worker node 112 may perform data transformation,accounting of resources, and send crawling results to the master node110. Synchronizing with the master node 110 may use information such asthe local state graph, and the new jobs which were discovered whilecrawling.

The worker node 112 may compress the local state graph. The worker node112 may compress the local state graph through any suitable method. Inone embodiment, the worker node 112 may use theCompressGraph(localStateGraph) function. The worker node 112 may usestate compression algorithms to represent each state in a state graph.Such compression algorithms may represent the state graph incrementallyand reduce the size of the graph. The worker node 112 may produce acompressed state graph as a result of such compression.

The worker node 112 may send information to the master node 110. Suchinformation may include a local state graph—or a compressed or modifiedversion of it—and a list of the new jobs that were encountered duringcrawling of the dynamic web application. The worker node 112 send suchinformation through any suitable method. In one embodiment, the workernode 112 may use the function SendToMaster(deltaTrace, newJobs) toaccomplish such tasks. The worker node 112 may communicate resultscomputed at the current worker node, since the last synchronizationevent, to the master node 110.

The worker node 112 may then mark portions of the local state graph assynchronized with the master node 110. The worker node 112 may performsuch tasks through any suitable method. In one embodiment, the workernode 112 may use the function MarkSentStates(localStateGraph). Theworker node 112 may annotate the portion of a graph such aslocalStateGraph so that it is not retransmitted in futuresynchronization events. Such markings may be used by functions such asCompressGraph or SendToMaster to determine that certain portions of thestate graph do not need to be retransmitted to the master node 110.

When a state has been visited before, or if crawling the local stategraph has exceeded the defined resource bounds, the worker node 112 maysynchronize with the master node 110. In one embodiment, the worker node112 may synchronize with the master node 110 using the localStateGraph,representing portions of the graph that have been generated since thelast synchronization event on this node, and newJobs, containing a listof pending crawling jobs generated during the crawl and to bepotentially executed in future by worker nodes 112 as assigned by themaster node 110. The localStateGraph may be compressed and stored into astructure such as deltaTrace. deltaTrace may contain portions of thetrace of the dynamic web application that, from the worker node'sperspective, may not be contained at the master node 110. The existinglocal state graph, such as localStateGraph, may be marked assynchronized with the master node 110. The worker node 112 may reset orempty the structure containing new jobs to be synchronized with themaster node 110.

Distributed computing system 100 may utilize a technique compression ofstate information in the crawling of interactive client-serverapplications, including dynamic web applications. As described above, aworker node 112 may compress a state graph to reduce the informationtransmitted to a master node 110 during synchronization, and master node110 may uncompress a state graph to reconstruct newly discovered states.

In one embodiment, the worker node 112 may optimize the state graph bycompressing successive states or screens encountered in dynamic webapplications that include only minor modifications of the previousscreen. In such an embodiment, the two successive screens share much oftheir underlying DOM. For example, for the screen transition graph ofFIG. 6, the underlying DOM representation of the initial state S1 asexplained above shows the value assigned to Button1 as “Click Me!” andthe value assigned to Button2 as “Click Me Too!.” When Button1 isclicked on this screen, causing the transition to state S2, the onlychange in the underlying DOM is the change of the value attribute ofelement /HTML[1]/BODY[1]/INPUT[1] from “Click me !” to “I'm clicked”.Thus, state S2 may be represented, instead of the full representation,by

<html> <body[1]>   <input[1] changed=“attrs” id=“button1”style=“display:block”     class=“btn” type=“button” name=“firstButton”onclick=“toggle1( );” value=“I'm clicked” /> </body[1]> </html>

Thus, in one embodiment, the worker node 112 may mark and represent onlythose portions of a current screen of the dynamic web application (inthe above example, S2) where the current screen differs from theprevious or reference screen (in the above example, S1). The worker node112 may mark and represent only those portions of a current screen whichdiffer from the previous screen in any suitable manner. In oneembodiment, the worker node 112 may accomplish these tasks through allor part of the following pseudocode:

  Algorithm - CompressState(refScrn, newScrn)   global refScrn, newScrn  procedure MarkChange(node)     if Exists(node, refScrn)&NumChild(node) ≧ NumChild(GetTwin(node, refScrn))     then       if 

 AttrsEqual(node, refScrn)       then         node.changed ← “attrs”        GetParent(node).childDiff ← true       for each child inChildNodes(node)         do MarkChange(child)     else      node.changed ← “tag”       GetParent(node).childDiff ← true     ifnode.childDiff = true       then GetParent(node).childDiff ← true   main    for each node in newScrn       do         node.change ← false        node.childDiff ← false       MarkChange(newScrn.root)      deltaScrn ← ExtractDelta(newScrn, refScrn)     return (deltaScrn)

The worker node 112 may compress the states or screens between areference screen, such as refScren, and a target screen, such asnewScrn. The target screen may be a screen whose compressedrepresentation is required. The reference screen may be any suitablescreen. The reference screen may be selected based on similarity to thetarget screen. Thus, the screen which was visited immediately beforevisiting the target screen, or another predecessor screen, may likely bechosen. The reference screen may provide the reference with respect towhich the compression is performed. The worker node 112 may compress agiven state in a state graph primarily in two phases: a marking phaseand an extraction phase, discussed below.

The worker node 112 may initialize each node in the target screen, thenenter the marking phase, and then enter the extraction phase wherein theresults of the marking phase are extracted and returned as thecompressed phase.

During initialization, the worker node 112 may compress the state of anewly crawled target screen such as newScrn referencing a referencescreen such as refScrn by first initializing all nodes within the targetscreen. The worker node 112 may set markers denoting a change in thenode and denoting change in children nodes to false. The worker node 112may set two markers to be attached to each node in the DOM of a givenscreen or the screen in question. The first marker may represent changesmade to the current node between the reference and target screens. Thefirst marker may be designated as change. In various embodiments, changemay have three different values: “false,” “attrs” or “tag.” The “false”value may denote that the node is the same in the target and referencescreens. Such a denotation may be made in terms of a tag name,attributes, or any other suitable characteristic. The “attrs” value maydenote that the node has the same tag name in the target screen as itdoes in the reference screen, but one or more of the attributes differin values. The “tag” value may denote that this node has structurallydifferent representations in both screen. For example, such structurallydifferent representations may include nodes with different tags at itsposition in the two screens, or the case where no node is present atthat position in the reference screen, or the case where a node withgreater number of children is present at that position in the referencescreen. The second marker may represent that one or more of the node'sdescendents have had their change marker set to a non-false value, andhence the node may need to be present in the compressed representationto provide a path to the descendents, who have experienced a change. Thesecond marker may be designated as childDiff. childDiff may accept atrue or false value, wherein the true value indicates that change hashappened to a descendant of the node.

Next, in the marking phase, the worker node 112 may compare the targetscreen to the reference screen, in order to identify what portions ofthe target screen differ from the reference screen and mark themaccordingly. The worker node 112 may accomplish this task through anysuitable method. In one embodiment, the worker node 112 may use thefunction MarkChange to compare the reference screen and the new screen.The worker node 112 may mark the portions of the target screen whichhave changed in reference to the reference screen. The worker node 112may begin such markings at the root of the target screen.

In marking the differences between the target screen and the referencescreen, the worker node 112 may begin with a starting node, such asnode, which may correspond to the root of the target screen. The workernode 112 may determine whether node is different than its equivalent inthe reference screen. If so, the worker node 112 may determine thatthere has been a change between the reference and target screens. Theworker node 112 may make such a determination by checking whether nodeexists in the reference screen, getting the twin of node in thereference screen, and comparing the number of children of node versusthe number of children of the twin of node in the reference screen.

In checking whether node exists in the reference screen, the worker node112 may determine whether a node exists in the target screen with thesame xpath position and the same tag name as a particular DOM element,such as node. The worker node 112 may make such a determination throughany suitable method. In one embodiment, the worker node 112 may makesure a determination by using the Exists(node, refScrn) function asshown above. The function may return true if and only if there is nodein refScrn at the same xpath position and with the same tag name as DOMelement node in newScrn.

In getting the twin of node, the worker node 112 may find and return aparticular specified node in a reference screen. The worker node 112 maymake such a finding through any suitable method. In one embodiment, theworker node 112 may make such a determination by using the GetTwin(node,refScren) function as shown above. The worker node 112 may return thenode corresponding to node that exists in refScrn using the xpathcorrespondence criterion used by Exists( ) above.

In comparing the number of children of node versus the number ofchildren of the twin of node, the worker node 112 may determine a numberof children nodes of a given node in the DOM tree of a screen or state.The worker node 112 may make such a determination through any suitablemethod. In one embodiment, the worker node 112 may make such adetermination by using the NumChild(node) function as shown above.

If a twin counter-part of node exists in the reference screen and if ithas the same or fewer number of children as node, then the worker node112 may determine whether the twin of node has exactly the sameattributes as node, and if not, change node's and its parent's markersto reflect such a condition by assigning the changed marker of node tobe “attrs,” and to get the parent of node and change that parent'schildDiff marker to be “true.”

In getting the parent of node, the worker node 112 may determine theparent node of a specified node in the DOM tree. The worker node 112 maymake such a determination through any suitable method. In oneembodiment, the worker node 112 may make such a determination by usingthe GetParent(node) function as shown above. The function may return theparent node of node in the DOM tree.

If the attributes of the twin node are identical to the node then theworker node 112 may denote that node is unchanged. In addition, ifExists(node, refScrn) & NumChild(node)≧NumChild(GetTwin(node, refScrn)))returns true, for each child of node, the worker node 112 mayrecursively process the child using the aforementioned marking scheme.In one embodiment, such a marking may be accomplished by callingMarkChange for each child found for node.

In determining children of node, the worker node 112 may determine thechildren node of the specified node in the DOM tree. The worker node 112may make such a determination through any suitable method. In oneembodiment, the worker node 112 may make such a determination by usingthe ChildNodes(node) function as shown above. The function may return anordered list of children nodes of a specified node such as node in theDOM tree.

Otherwise, if there has been a change between the reference and targetscreens, with respect to node, possibly by calling Exists(node; refScrn)& NumChild(node)≧NumChild(GetTwin(node; refScrn)) and getting a returnvalue of false), then the worker node 112 may denote that node ischanged. In one embodiment, the worker node 112 may make such adesignation by setting the changed tag of node to “tag.” Further, theworker node 112 may set a tag of the parent of node to indicate that theparent has a child who has changed. this may be accomplished by callingGetParent(node) and setting the result's childDiff parameter to “true.”

Finally, the worker node 112 may determine whether node has a child nodethat has changed, and if so, set a tag of the parent of node to indicatethat node's parent has a child who has changed. This may be accomplishedby checking the childDiff parameter of node, and then callingGetParent(node) and setting the result's childDiff parameter to “true.”

In the extraction phase, the worker node 112 may use the marking of thedifferences between the target and reference screens to extract acompressed representation of the target screen with reference to thereference screen. The worker node 112 may accomplish this task throughany suitable method. In one embodiment, the worker node 112 may use thefunction ExtractDelta to extract the compressed representation of thetarget screen. The worker node 112 may extract the differences markedbetween the target screen and the reference screen, and store theresults in a structure such as deltaScrn. The worker node 112 may returnthe resulting deltaScrn, containing the compressed target screen. Such atarget screen may be used as a compressed state to be returned to masternode 110.

FIG. 7 is an example of a marked version of the DOM tree of a screen ofa dynamic web application that has been at least partially crawled. FIG.7 may represent the effects of marking target screen such as newScrn inreference to a reference screen such as refScrn. Such a marking may beused by the worker node 112 in the extraction phase to produced acompressed representation such as deltaScrn, by way of the functionExtractDelta. The parts of the marked DOM, retained or discarded show anexample of the compressed representation produced. For example, FIG. 7may represent the compression of the state S2 with respect to state S1,as shown in FIG. 5. In such an example, there may be sections of the DOMtree corresponding to an HTML node 702 of the DOM tree, HTML nodeattributes 703, a HEAD node 704, HEAD node attributes 706, a BODY node708, BODY node attributes 710, an INPUT node 712, DOM sub-tree 714associated with the INPUT node 712, and various other nodes andsub-trees 716. The operation of going from state S1 to S2 may reflect asa change in a DOM node such as INPUT node 712, its attributes, and inthe sub-tree of its descendent nodes 714. In addition, there may havebeen a change exclusively to the attributes 706 of the HEAD node. Thismay be the result of clicking the “Click Me!” button, wherein portionsof the script are activated and changes to the button values are made.These portions of the marked DOM model may be marked as changed, andthus included in a compressed version of the DOM model to be returned.Meanwhile, many other portions 716, 718 of the DOM model may remainunchanged between the two states S1 and S2. Thus, these portions may bemarked as unchanged, and thus removed in the compressed version of theDOM model to be returned. Some sections, such as the HTML node 702, HEADnode 704, and BODY node 708 may remain unchanged between the two statesS1 and S2, but may have children that did change. Thus, these sectionsmay be retained in the compressed version of the DOM model to bereturned so as to provide a path to the portions that did change.

Thus, the worker node 112 may return the portions of FIG. 7 marked asretained as a compressed representation, such as deltaScrn. Such acompressed representation may have sufficient information to uniquelyand completely reconstruct the original representation newScrn fromdeltaScrn and refScrn.

FIGS. 8a and 8b are an example embodiment of a method 800 forcoordinating the distributed, parallel crawling of interactiveclient-server applications such as dynamic web applications. Thepseudocode described above in the operation of distributed computingsystem 100 may implement some or all of method 800.

In step 805, a web application may be initialized for crawling. Suchinitialization may include determining one or more initial jobs,representing a starting positions or initial traces for crawling the webapplication. In one embodiment, the number of initial jobs created maybe greater than the number of resources available to execute such jobsin parallel. In step 810, any such determined jobs may be added to a jobqueue.

Two or more branches of method 800 may execute in parallel. One suchbranch may begin with step 815. Another such branch may begin with step850. Each branch may execute until the method is terminated. Adetermination of whether the method should be terminated may happen ineither branch, or in another branch of execution of method 800. In oneembodiment, such a determination may be made in the branch beginningwith step 815.

In step 815, it may be determined whether the job queue and the resourcequeue contain entries. Step 815 may be implemented in a polling scheme,event handler, or any other suitable mechanism. If the job queue andresource queue contain entries, then in step 820, a job may be selectedfrom the job queue. Any suitable method of selecting a job may be used.In one embodiment, a job may be selected on a first-in first-out basis.In step 825, a resource may be selected from the resource queue. Anysuitable method of selecting a resource may be used. In one embodiment,a resource may be selected on a first-in first-out basis. In step 830,the job may be assigned to be executed by the resource. Such anassignment may include the resource crawling a portion of the webapplication designated by the job. In step 835, the resource may beinitialized for execution of the job. Next, the method 800 may return tostep 815.

If either the job queue and resource queue do not contain entries, thenit may be determined whether the method should be terminated. In step840, it may be determined whether the job queue is empty and whether alljobs have been executed. If so, in step 845 such a case may reflect thatthe web application has been completely crawled, and the method mayexit. If not, then the method may return to step 815.

In step 850, it may be determined whether results have been receivedfrom any jobs that were previously assigned to resources. Step 850 maybe implemented in a polling scheme, event handler, or any other suitablemechanism. If results have not been received, then the method 800 mayreturn to step 850. If results have been received, then in step 855 anystate graphs received as part of the results may be uncompressed. Foreach state in a received state graph, in step 860 it may be determinedwhether the state is in the master state graph. If not, in step 865 thestate may be stored in the master state graph and the method 800 maymove to step 870. If so, the method 800 may move to step 870. For eachtransition in the received state graph, in step 870 it may be determinedwhether the transition is in the master state graph. If not, in step 875the transition may be added to the master state graph and the method 800may move to step 880. If so, the method 800 may move to step 880. Foreach job in the received results, it may be determined whether the jobis in the job queue or currently executing in a resource. If not, thenin 885 the job may be added to the job queue and the method 800 mayreturn to step 850. If so, then the method 800 may return to step 850.

FIG. 9 is an example embodiment of a method 900 for efficient partialcrawling of interactive client-server applications such as dynamic webapplications in a parallel, distributed environment. The pseudocodedescribed above in the operation of distributed computing system 100 mayimplement some or all of method 900.

In step 905, the execution of a job may be initialized. The job mayrepresent a portion of a web application to be crawled. Suchinitialization may include creating an empty state graph, wherein thestate graph may contain the results of crawling the web application. Arecord for containing new jobs discovered while crawling the webapplication may be initialized. An initial trace may be executed toarrive at a designated starting place in the web application. A screenof the web application at such a designated starting place may beloaded. In step 910, such a screen may be designated as a current state.

In step 915, it may be determined whether the current state has beenpreviously visited, according to the local graph. If so, the crawling ofthe job may be ended and the method 800 may move to step 975. If not,then in step 920 it may be determined whether execution of the job iswithin defined bounds. Any suitable method of determining whetherexecution of the job is within defined bounds may be used. If not, thencrawling of the job may be ended and the method 800 may move to step975. If so, then in step 930 it may be determined whether the stategraph is ready to be synchronized. Such a determination may synchronizethe state graph on a periodic basis. If so, then in step 932 the stategraph may be synchronized with a master state graph, and the method maymove to step 935. If not, then the method may move to step 935.

In step 935, crawling of the web application may happen by firstdetermining the possible actions available at the current state. In oneembodiment, such actions may be based upon information contained withinthe DOM of the state. In step 940, the possible actions may be added toa list of unperformed actions. In step 945, an action to be performedmay be selected from the list of unperformed actions. Any suitablebasis, crawling technique, or search strategy may be used to selectwhich action should be performed. The selected action may be removedfrom the unperformed action list in step 950, and then executed in step955. In step 960, the result of executing the step 955 may be designatedas the new current state. In step 965, one or more jobs may be createdfrom the list of unperformed actions, and in step 970 the new jobs maybe added to a list of new jobs. Such a list of new jobs may betransmitted during synchronization to a job queue for future executionby a resource. The method 800 may then return to step 915.

In step 975, the state graph may be synchronized with the master stategraph. This step may be implemented in the same manner as step 932.Other information regarding the execution of the job may be transmittedto a master node. In step 980, a indication of availability of thecurrent worker node 112 may be registered in a resource queue.

FIG. 10 is an example embodiment of a method 1000 for synchronizing astate graph created from crawling a portion of an interactiveclient-server application with a master state graph of the application.In some embodiments, method 1000 may implement some or all of steps 932and 975 of FIG. 9. The pseudocode described above in the operation ofdistributed computing system 100 may implement some or all of method1000.

In step 1005, a state graph to be synchronized with a master state graphmay be compressed. Each state within the graph may be compressed usingany suitable technique, including those discussed herein. The stategraph may contain information from executing a job, the job indicating aportion of a web application to be crawled. In step 1010, the result ofsuch a compression may be stored. The result may represent thedifference between the state graph and a previous state graph that wasalready synchronized. In step 1015, the compressed state graph and/or alist of new jobs may be sent to a master node, which may control themaster state graph and may be configured to merge the two. In step 1020,the state graph may be marked as synchronized with the master node. Suchmarkings may be used by a future instance of method 1000 during step1010. In step 1025, the list of new jobs may be cleared.

FIG. 11 is an example embodiment of a method 1100 for compression ofstate information in the crawling of interactive client-serverapplications such as dynamic web applications. The pseudocode describedabove in the operation of distributed computing system 100 may implementsome or all of method 1100.

In step 1105, an application may be crawled to create a state graph. Thestate graph may represent the operation of the application.Alternatively, a state graph may be received or other otherwisedetermined. For each state in the state graph, steps 1115-1145 may beconducted.

In step 1115, a screen associated with the given state may bedetermined. The following steps may attempt to compress such a screen.In step 1120, a model of the screen may be determined. In oneembodiment, such a model may include a DOM model. In step 1125, areference screen for the screen may be determined. Such a referencescreen may include a previous screen, on which an action was taken thatled to the given screen.

The given screen may contain one or more nodes as part of its model. Foreach such node, in step 1130 the node may be initialized. Suchinitialization may include setting indications that the node isunchanged. Upon finding a change in the node in comparison to thereference screen, such indications may be subsequently changed.

In step 1135, differences between the screen and the reference screenmay be marked. Such differences may be marked starting at the root nodeof the screen.

In step 1140, such marked changes between the screen and the referencescreen may be extracted. Such extracted, marked changes may be stored asa compressed version of the given state. In step 1145, the compressedstate may be returned.

FIG. 12 is an example embodiment of a method 1200 for marking thechanges between a screen and a reference screen. The pseudocodedescribed above in the operation of distributed computing system 100 mayimplement some or all of method 1200. In some embodiments, some or allof step 1135 of FIG. 11 may be implemented by method 1200.

In step 1205, a starting node in the model of the screen to be markedmay be determined. Such a starting node may be a root node of the screento be marked, or another node as designated by the entity invoking themethod 1200. Similarly, in step 1210 a reference screen may bedetermined. Such a reference screen may be designated by the entityinvoking the method 1200.

In step 1215, it may be determined whether the node exists in thereference screen. If so, then the node's children might be explored todetermine any changes between such children and the reference screen. Ifnot, then the node's children might not be explored to determine anychanges between such children and the reference screen.

If the node exists in the reference screen, then in step 1220 the twinof the node in the reference screen may be obtained. In step 1225, thenumber of children of the twin node may be determined, as in step 1230the number of children of the present node may be determined.

In step 1235 it may be determined whether the present node has an equalor greater number of children than the twin. If so, then in step 1240 itmay be determined whether or not the attributes of the node and the twinnode are equal. Such attributes may be a part of a DOM model. If theattributes are not equal, then in step 1245 the node may be marked aschanged. In one embodiment, indicators concerning the node attributesmay be marked as changed. In step 1247, a parent of the node may bedetermined, and an indicator on such a parent node may be marked to showthat the parent has a changed child node. In step 1250, for each childof the present node, the method 1200 may be called recursively. If theattributes of the node and the twin node are equal, then, then themethod 1200 may similarly move to step 1250. After the recursive callsto the children nodes have been made, the method 1200 may proceed tostep 1265.

If the present node does not have an equal or greater number of childrenthan the twin node, then the method may proceed to step 1255, whereinthe node is marked as changed. In step 1260, a parent of the node may bedetermined, and an indicator on such a parent node may be marked to showthat the parent has a changed child node. Step 1260 and step 1247 may beimplemented in the same fashion. The method 1200 may then proceed tostep 1265.

In step 1265, it may be determined whether the node has any changedchild nodes. Such a determination may be made by examining theindications of the node for such a designation. The node may have beenmarked as such through the recursive call of method 1200 for children ofthe node, which during the operation of method 1200, may have markednode as having a changed child. If the node has any changed child nodes,then in step 1270 a parent of the node may be determined, and anindicator on such a parent node may be marked to show that the parenthas a changed child node. Step 1270, 1260 and step 1247 may beimplemented in the same fashion. Method 1200 may then proceed to step1275, wherein the method 1200 may exit.

Although FIGS. 8-12 disclose a particular number of steps to be takenwith respect to example methods 800, 900, 1000, 1100, and 1200, methods800, 900, 1000, 1100, and 1200 may be executed with more or fewer stepsthan those depicted in FIGS. 8-12. In addition, although FIG. 8-12disclose a certain order of steps to be taken with respect to methods800, 900, 1000, 1100, and 1200, the steps comprising methods 800, 900,1000, 1100, and 1200 may be completed in any suitable order.

Methods 800, 900, 1000, 1100, and 1200 may be implemented using thesystem of FIGS. 1-7, or any other system, network, or device operable toimplement methods 800, 900, 1000, 1100, and 1200. In certainembodiments, methods 800, 900, 1000, 1100, and 1200 may be implementedpartially or fully in software embodied in computer-readable media.

For the purposes of this disclosure, computer-readable media may includeany instrumentality or aggregation of instrumentalities that may retaindata and/or instructions for a period of time. Computer-readable mediamay include, without limitation, storage media such as a direct accessstorage device (e.g., a hard disk drive or floppy disk), a sequentialaccess storage device (e.g., a tape disk drive), compact disk, CD-ROM,DVD, random access memory (RAM), read-only memory (ROM), electricallyerasable programmable read-only memory (EEPROM), and/or flash memory; aswell as communications media such wires, optical fibers, and othertangible, non-transitory media; and/or any combination of the foregoing.

Although the present disclosure has been described in detail, it shouldbe understood that various changes, substitutions, and alterations canbe made hereto without departing from the spirit and the scope of thedisclosure.

What is claimed is:
 1. A distributed computing system, comprising: afirst worker node configured to execute a first job, the first jobindicating a first portion of an interactive client-server applicationto be crawled; a second worker node configured to execute a second job,wherein: the second job indicates a second portion of the interactiveclient-server application to be crawled; the second worker node andfirst worker node are configured to execute their respective jobs inparallel and to crawl the interactive client-server application througha single address; and the interactive client-server application isinitially and subsequently accessed by both the first worker node andthe second worker node through the single address and crawled throughsuccessive screens of the single address, wherein the successive screensreflect a partial change in content, the first worker node and secondworker node configured to identify such a partial change in contentbetween successive screens to produce crawling results; and a masternode comprising a processor coupled to a memory, the master nodeconfigured to: assign the first job to the first worker node; assign thesecond job to the second worker node; integrate the crawling resultsfrom the first worker node and the second worker node into a record ofoperation of the application.
 2. The distributed computing system ofclaim 1 wherein the application to be crawled comprises a dynamic webapplication to be crawled.
 3. The distributed computing system of claim1 wherein: the memory comprises a job data structure, the job datastructure containing indications of a plurality of jobs; and the masternode is configured to select the first job and the second job from thejob data structure.
 4. The distributed computing system of claim 1further comprising a resource data structure, the resource datastructure containing indications of a plurality of worker nodesavailable to execute job, wherein the master node is configured toselect the first worker node and the second worker node from theresource data structure.
 5. The distributed computing system of claim 3,wherein: the results returned from the first worker node contains newjobs discovered while crawling the first job; and the master node isconfigured to integrate the new jobs returned from the first worker nodeinto the job data structure.
 6. The distributed computing system ofclaim 1 wherein the master node is configured to purge duplicate jobs,wherein the duplicate jobs comprise jobs for which results have beenreceived or jobs which are being executed on worker nodes.
 7. Thedistributed computing system of claim 6, wherein the duplicate job ispurged from a third worker node.
 8. The distributed computing system ofclaim 6, wherein: the memory comprises a job data structure, the jobdata structure containing indications of a plurality of jobs; and theduplicate job is purged from the job data structure.
 9. The distributedcomputing system of claim 1 wherein the first worker node and the secondworker node comprise a partial representation of the record of operationof the application.
 10. A method of verifying an interactiveclient-server application, comprising: selecting and assigning a firstjob indicating a first portion of an interactive client-serverapplication to be crawled; selecting and assigning a second jobindicating a second portion of the application to be crawled; executingthe first job and executing the second job in parallel, wherein:executing the first job and executing the second job includes crawlingthe interactive client-server application through a single address; andthe interactive client-server application is initially and subsequentlyaccessed during execution of both the first job and the second jobthrough the single address and crawled through successive screens of thesingle address, wherein the successive screens reflect a partial changein content, the first worker node and second worker node configured toidentify such a partial change in content between successive screens toproduce crawling results; and integrating partial results from the firstjob and the second job into a record of operation of the interactiveclient-server application.
 11. The method of claim 10 wherein theapplication to be crawled comprises a dynamic web application to becrawled.
 12. The method of claim 10 further comprising selecting thefirst job and the second job from a job data structure containingindications of a plurality of jobs to be executed to crawl theinteractive client-server application.
 13. The method of claim 12,further comprising: returning new jobs discovered while crawling thefirst job; and integrating the new jobs into the job data structure. 14.The method of claim 10 further comprising purging duplicate jobs,wherein the duplicate jobs comprise jobs for which results have beenfound or jobs which are being executed on worker nodes.
 15. The methodof claim 10 wherein execution of the first job and the second job isconducted with a partial representation of the record of operation ofthe interactive client-server application.
 16. An article of manufacturecomprising: a non-transitory computer readable medium; andcomputer-executable instructions carried on the non-transitory computerreadable medium, the instructions readable by a processor, theinstructions, when read and executed, for causing the processor to:select and assign a first job indicating a first portion of aninteractive client-server application to be crawled; select and assign asecond job indicating a second portion of the interactive client-serverapplication to be crawled; execute the first job and executing thesecond job in parallel, wherein: execution of the first job andexecution of the second job includes crawling the interactiveclient-server through a single address; the interactive client-serverapplication is initially and subsequently accessed during execution ofboth the first job and the second job through the single address andcrawled through successive screens of the single address, wherein thesuccessive screens reflect a partial change in content, the first workernode and second worker node configured to identify such a partial changein content between successive screens to produce crawling results; andintegrate partial results from the first job and the second job into arecord of operation of the application.
 17. The article of claim 16wherein the interactive client-server application to be crawledcomprises a dynamic web application to be crawled.
 18. The article ofclaim 16 wherein the processor is further caused to select the first joband the second job from a job data structure containing indications of aplurality of jobs to be executed to crawl the application.
 19. Thearticle of claim 18, wherein the processor is further caused to: returnnew jobs discovered while crawling the first job; and integrate the newjobs into the job data structure.
 20. The article of claim 16 whereinthe processor is further caused to purge duplicate jobs, wherein theduplicate jobs comprise jobs for which results have been found or jobswhich are being executed on worker nodes.
 21. The article of claim 16wherein execution of the first job and the second job is conducted witha partial representation of the record of operation of the application.